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(54) Method and apparatus for reliable disk fencing in a multicomputer system 



(57) A metliod and apparatus for fast and reliable 
fencing of resources such as shared disks on a net- 
worked system. For each new configuration of nodes 
and resources on the system, a membership program 
module generates a new membership list and, based 
upon that, a new epoch number uniquely identifying the 
membership correlated with the time that it exists. A con- 
trol key based upon the epoch number is generated, and 
is stored at each resource controller and node on the 
system. If a node is identified as failed, it is removed 
from the membership list, and a new epoch number and 



control key are generated. When a node sends an ac- 
cess request to a resource, the resource controller com- 
pares its locally stored control key with the control key 
stored at the node (which is transmitted with the access 
request). The access request is executed only if the two 
keys match. The membership list is revised based upon 
a node's determination (by some predetermined criteri- 
on or criteria, such as slow response time) of the failure 
of a resource, and is carried out independently of any 
action (either hardware or software) of the failed re- 
source. 
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D 8 ription 

The present invention relates to a system for relia- 
ble disk fencing of shared disks in a multicomputer sys- 
tem, e.g. a cluster, wherein multiple computers (nodes) 
have concurrent access to the shared disks. In particu- 
lar, the system is directed to a high availability system 
with shared access disks. 

Background of the Invention 

In clustered computer systems, a given node may 
"fair. i.e. be unavailable according to some predefined 
criteria which are followed by the other nodes. Typically, 
for instance, the given node may have failed to respond 
to a request in less than some predetermined amount 
of time. Thus, a node that is executing unusually slowly 
may be considered to have failed, and the other nodes 
will respond accordingly. 

When a node (or more than one node) fails, the re- 
maining nodes must perfonn a system reconfiguration 
to remove the failed node(s) from the system, and the 
remaining nodes preferably then provide the services 
that the failed node(s) had been providing. 

It is important to isolate the failed node from any 
shared disks as quickly as possible. Otherwise, if the 
failed (or slowly executing) node is not isolated by the 
time system reconfiguration is complete, then it could, 
e.g., continue to make read and write requests to the 
shared disks, thereby corrupting data on the shared 
disks. 

Disk fencing protocols have been developed to ad- 
dress this type of problem. For instance, in the VAXclus- 
ter system, a "deadman brake" mechanism" is used. 
See Davis, RJ., VAXcluster Principles (Digital Press 
1993), incorporated herein by reference. In the VAX- 
cluster system, a failed node is isolated from the new 
configuration, and the nodes in the new configuration 
are required to wait a certain predetermined timeout pe- 
riod before they are allowed to access the disks. The 
deadman brake mechanism on the isolated node guar- 
antees that the isolated node becomes "idle" by the end 
of the timeout period. 

The deadman brake mechanism on the isolated 
node in the VAXcluster system involves both hardware 
and software. The software on the isolated node is re- 
quired to periodically tell the cluster Interconnect adap- 
tor (CI), which is coupled between the shared disks and 
the cluster interconnect, that the node is "sane". The 
software can detect in a bounded time that the node is 
not a part of the new configuration. If this condition is 
detected, the software will block any disk I/O, thus set- 
ting up a software "fence" preventing any access of the 
shared disks by the failed node. A disadvantage pre- 
sented by the software fence is that the software must 
be reliable; failure of (or a bug in) the "fence" software 
results in failure to block access of the shared disks by 
the ostensibly isolated node. 



If the softwar xecutes too slowly and thus does 
not set up the software fence in a timely fashion, the CI 
hardware shuts off the node from the interconnect, 
thereby setting up a hardware fence, i.e. a hardware ob- 
5 stacle disallowing the failed node from accessing the 
shared disks. This hardware fence is implemented 
through a sanity timer on the CI host adaptor. The soft- 
ware must periodically tell the CI hardware that the soft- 
ware is "sane". A failure to do so within a certain time- 
10 out period will trigger the sanity timer in CI. This is the 
"deadman brake" mechanism. 

Other disadvantages of this node isolation system 
are that 

IS • it requires an interconnect adaptor utilizing an inter- 
nal timer to implement the hardware fence. 

• the solution does not work if the interconnect be- 
tween the nodes and disks includes switches or any 

20 other buffering devices. A disk request from an iso- 
lated node could otherwise be delayed by such a 
switch or buffer, and sent to the disk after the new 
configuration is already accessing the disks. Such 
a delayed request would corrupt files or databases. 

25 

• depending on the various time-out values, the time 
that the members of the new configuration have to 
wait before they can access the disk may be too 
long, resulting in decreased performance of the en- 

30 tire system and contrary to high -availability princi- 
ples. 

From an architectural level perspective, a serious 
disadvantage of the foregoing node isolation methodol- 
35 ogy is that it does not have end-to-end properties; the 
fence is set up on the node rather than on the disk con- 
troller. 

It would be advantageous to have a system that pre- 
sented high availability while rapidly setting up isolation 
40 of failed disks at the disk controller 

Other UNIX-based clustered systems use SCSI 
(small computer systems interface) "disk reservation" to 
prevent undesired subsets of clustered nodes from ac- 
cessing shared disks. See, e.g.. the ANSI SCSI-2 Pro- 
45 posed Standard for information systems (March 9. 
1990, distributed by Global Engineering Documents), 
which is incorporated herein by reference. Disk reser- 
vation has a number of disadvantages; for instance, the 
disk resen/ation protocol is applicable only to systems 
50 having two nodes, since only one node can reserve a 
disk at a time (i.e. no other nodes can access that disk 
at the same time). Another is that in a SCSI system, the 
SCSI bus reset operation removes any disk reserva- 
tions, and it is possible for the software disk drivers to 
ss issue a SCSI bus reset at any time. Therefor . SCSI disk 
reservation is not a reliable disk fencing technique. 

Another node isolation methodology involves a 
"poison pill"; when a node is removed from the system 
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during reconfiguration, one of the remaining nodes 
sends a "poison pill", i.e. a request to shut down, to the 
failed node. If the failed node is in an active state (e.g. 
executing slowly), it takes the pill and beconnes idle with- 
in some predetermined time. 

The poison pill is processed either by the host adap- 
tor card of the failed node, or by an interrupt handler on 
the failed node. If it is processed by the host adaptor 
card, the disadvantage is presented that the system re- 
quires a specially designed host adaptor card to imple- 
ment the methodology. If it is processed by an interrupt 
handler on the failed node, there is the disadvantage 
that the node isolation is not reliable; for instance, as 
with the VAXcluster discussed above, the software at 
the node may itself by unreliable, time-out delays are 
presented, and again the isolation is at the node rather 
than at the shared disks. 

A system is therefore needed that prevents shared 
disk access at the disk sites, using a mechanism that 
both rapidly and reliably blocks an isolated node from 
accessing the shared disks, and does not rely upon the 
isolated node itself to support the disk access preven- 
tion. 

Summary of the Invention 

The present invention utilizes a method and appa- 
ratus for quickly and reliably isolating failed resources, 
including VO devices such as shared disks, and is ap- 
plicable to a virtually any shared resource on a computer 
system or network. The system of the invention main- 
tains a membership list of ail the active shared resourc- 
es, and with each new configuration, such as when a 
resource is added or fails (and thus should be function- 
ally removed), the system generates a new epoch 
number or other value that uniquely identifies that con- 
figuration at that time. Thus, identical memberships oc- 
curring at different times will have different epoch num- 
bers, particularly if a different membership set has oc- 
curred in between. 

Each time a new epoch number is generated, a con- 
trol key value is derived from it and is sent to the nodes 
in the system, each of which stores the control key lo- 
cally as its own node key. The controllers for the resourc- 
es (such as disk controllers) also store the control key 
locally. Thereafter, whenever a shared resource access 
request is sent to a resource controller, the node key is 
sent with it. The controller then checks whether the node 
key matches the controller's stored version of the control 
key, and allows the resource access request only if the 
two keys match. 

When a resource fails, e.g. does not respond to a 
request within some predetermined period of time (indi- 
cating a possible hardware or software defect), the 
membership of the system is determined a new, elimi- 
nating the failed resource. A new epoch number is gen- 
erated, and therefrom a new control key is generated 
and is transmitted to the all the resource controllers and 



nodes on the system. If an access request arrives at a 
resource controller after the new control key is generat- 
ed, the access request will bear a node key that is dif- 
ferent from the current control key, and thus the request 

5 will not be executed. This, coupled with preventing 
nodes from issuing acc ss requests to resources that 
are not in the current membership set, ensures that 
failed resources are quickly eliminated from access, by 
requiring that all node requests, in order to be proc- 

10 essed, have current control key (and hence member- 
ship) information. 

The nodes each store program modules to carry out 
the functions of the invention - e.g., a disk (or resource) 
manager module, a distributed lock manager module, 

15 and a membership module. The distribution of these 
modules allows any node to identify a resource as failed 
and to communicate that to the other nodes, and to gen- 
erate new membership lists, epoch numbers and control 
keys. 

20 The foregoing system therefore does not rely upon 
the functioning of a failed resource's hardware or soft- 
ware, and provides fast end-to-end (i.e. at the resource) 
resource fencing. 



Figure 1 is a top-level block diagram showing sev- 
eral nodes provided with access to a set of shared discs. 

Figure 2 is a more detailed block diagram of a sys- 
tem similar to that of Figure 1 , but showing elements of 
the system of the invention that interact to achieve disk 
fencing. 

Figure 3 is a diagram illustrating elements of the 
structure of each node of Figure 2 or Figure 3 before 
and after reconfiguration upon the unavailability of node 
D. 

Figure 4 is a block diagram of a system of the in- 
vention wherein the nodes access more than one set of 
shared disks. 

Figure 5 is a flow chart illustrating the method of the 
invention. 

Description of the Preferred Embodiments 

The system of the invention is applicable generally 
to clustered systems, such as system 10 shown in Fig- 
ure 1, including multiple nodes 20-40 (Nodes 1-3 in this 
example) and one or more sets of shared disks 50. Each 
of nodes 20-40 may be a conventional processor-based 
system having one or more processors and including 
memory, mass storage, and user I/O devices (such as 
monitors, keyboards, mouse, etc.), and other conven- 
tional computer system elements (not alt shown in Fig- 
ure 1), and configured for operation in a clustered envi- 
ronment. 

Disks 50 will be accessed and controlled via a disk 
controller 60, which may include conventional disk con- 
troller hardware and software, and includes a processor 
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and memory (not separately shown) for carrying out disk 
control functions, in addition to the features described 
below. 

The system of the invention may in general be im- 
plemented by software modules stored in the memories 
of the nodes 20-40 and of the disk controller. The soft- 
ware modules may be constructed by conventional soft- 
ware engineering, given the following teaching of suita- 
ble elements for implementing the disk fencing system 
of the invention. Thus, in general in the course of the 
following description, each described function may be 
implemented by a separate program module stored at 
a node and/or at a resource (e.g. disk) controller as ap- 
propriate, or several such functions may be implement- 
ed effectively by a single multipurpose module. 

Figure 2 illustrates in greater detail a clustered sys- 
tem 70 implementing the invention. The system 70 in- 
cludes four nodes 80-110 (Nodes A-D) and at least one 
shared disk system 120. The nodes 80-110 may be any 
conventional cluster nodes (such as workstations, per- 
sonal computers or other processor-based systems like 
nodes 20-40 or any other appropriate cluster nodes), 
and the disk system may be any appropriate shared disk 
assembly, including a disk system 50 as discussed in 
connection with Figure 1 . 

Each node 80-110 includes at least the following 
software modules: disk manager (DM), an optional dis- 
tributed lock manager (DLM), and membership monitor 
(MM). These modules may be for the most part conven- 
tional as in the art of clustered computing, with modifi- 
cations as desired to implement the features of the 
present Invention. The four MM modules MMA-MMD 
are connected in communication with one another as 
illustrated in Figure 2, and each of the disk manager 
modules DMA-DMD is coupled to the disk controller (not 
separately shown) of the disk system 120. 

Nodes in a conventional clustered system partici- 
pate in a "membership protocol", such as that described 
in the VAXcluster Principles cited above. The member- 
ship protocol is used to establish an agreement on the 
set of nodes that form a new configuration when a given 
node is dropped due to a perceived failure. Use of the 
membership protocol results in an output including (a) 
a subset of nodes that are considered to be the current 
members of the system, and (b) an "epoch number" 
(EN) reflecting the current status of the system. Alterna- 
tives to the EN include any time or status value uniquely 
reflecting the status of the system for a given time. Such 
a membership protocol may be used in the present sys- 
tem. 

According to membership protocol, whenever the 
membership set changes a new unique epoch number 
is generated and is associated with the new member- 
ship set. For example, if a system begins with a mem- 
bership of four nodes A-D (as in Figure 2), and an epoch 
number 100 has been assigned to the current configu- 
ration, this may be represented as <A, B, C, D; #100> 
or <MEM=A, B, C, D; EN=100>, where MEM stands for 



"membership". This is the configuration represented in 
Figure 3(a), where all four nodes are active, participating 
nodes in the cluster 

If node D crashes or is detected as malfunctioning, 
5 the new membership becomes <MEM=A, B, C; 
EN=101>; that is, node D is eliminated from the mem- 
bership list and the epoch number Is incremented to 
101. indicating that the epoch wherein D was most re- 
cently a member is over. While all the nodes that partic- 
10 ipate in the new membership store the new membership 
list and new epoch number, failed node D (and another 
other failed node) maintains the old membership list and 
the old epoch number. This is as illustrated in Figure 3 
(b), wherein the memories of nodes A-C all store 
15 <MEM=A, B, C; EN=101>. while failed and isolated 
node D stores <MEM=A, B, C, D; EN=100>. 

The present invention takes utilizes this fact -- i.e. 
that the current information is stored by active nodes 
while outdated information is stored by the isolated node 
20 (s) to achieve disk fencing. This is done by utilizing 
the value of a "control key" (CK) variable stored by the 
nodes and the shared disk system's controller (e.g. in 
volatile memory of the disk controller). 

Figure 4 is a block diagram of a four-node clustered 
25 system 400 including nodes 410-440 and two shared 
disk systems 450-460 including disks 452-456 (system 
450) and 462-466 (system 460). Disk systems 450 and 
460 are controlled, respectively, by disk controllers 470 
and 480 coupled between the respective disk controllers 
30 and a cluster interconnect 490. 

The nodes 410440 may be processor-based sys- 
tems as described above, and the disk controllers are 
also as described above, and thus the nodes, shared 
disk systems (with controllers) and cluster interconnect 
35 may be conventional in the art, with the addition of the 
features described herein. 

Each node stores both a "node key" (NK) variable 
and the membership information. The NK value is cal- 
culated from the current membership by one of several 
40 alternative functions, described below as Methods 1 -3. 
Figure 4 shows the generalized situation, taking into ac- 
count the possibility that any of the nodes may have a 
different CK number than the rest, if that node has failed 
and been excluded from the membership set. 
45 As a rule, however, when all nodes are active, their 
respective stored values of NK and the value of CK 
stored at the disk controllers will all be equal. 

Node/Disk Controller Operations Using Node Key and 
Control Key Values 

Each read and write request by a node for access- 
ing a disk controller includes the NK value; that is, when- 
ever a node requests read or write access to a shared 
55 disk, the NK value is passed as part of the request. This 
inclusion of the NK value in read and write requests thus 
constitutes part of the protocol between the nodes and 
the controller(s). 
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The protocol between the nodes and disk controller 
also includes two operations to man ipu late the CK value 
on the controller: GetKey to read the current CK value, 
and SetKey to set the value of CK to a new value. 
GetKey does not need to provide an NK value, a CK 
valu , or an EN value, while the SetKey protocol uses 
the NK value as an input and additionally provides a new 
CK value "new.CK" to be adopted by the controller. 

The four foregoing requests and their input/output 
arguments may be represented and summarized as fol- 
lows: 

Read(NK, ...) 
Write(NK, ...) 
GetKey(...) 
SetKey(NK. new.CK) 

The GetKey (...) operation returns the current value 
of CK. This operation is never rejected by the controller. 

The SetKey(NK. new.CK) operation first checks if 
the NK field in the request matches the current CK value 
in the controller. In the case of a match, the CK value in 
the controller is set equal to the value in the "new.CK" 
field (in the SetKey request). If NK from the requesting 
node doesn't match the current CK value stored at the 
controller, the operation is rejected and the requesting 
node is sent an error indication. 

The Read(NK, ...) and Write(NK, ...) operations are 
allowed to access the disk only if the NK field in the pack- 
et matches the current value of CK. Othenwise, the op- 
eration is rejected by the controller and the requesting 
node is sent an error indication. 

When a controller is started, the CK value is prefer- 
ably initialized to 0. 

Procedure Upon Failure of a Node 

When the membership changes because one or 
more failed nodes are being removed from the system, 
the remaining nodes calculate a new value of CK from 
the new membership information (in a manner to be de- 
scribed below). One of the nodes communicates the 
new CK value to the disk controller using the SetKey 
(NK, new.CK) operation. After the new CK value is set, 
all member (active) nodes of the new configuration set 
their NK value to this new CK value. 

If a node is not a part of the new configuration (e.g. 
a failed node), it is not allowed to change its NK. If such 
a node attempts to read or write to a disk, the controller 
finds a mismatch between the new CK value and the old 
NK value. 

When a node is started, its NK is initialized to a 0 
value. 

Procedures for Calculating Values of the Control Key 
(CK) 

The control key CK may be set in a number of dif- 



ferent ways. The selected calculation will b reflected in 
a software or firmware module stored and/or mounted 
at least at the controller. In general, the calculation of 
the CK value should take into account the membership 
s information: 

CK = func(MEM. EN) 

10 where: MEM includes information about the active 

membership list; 

and EN is the epoch number. 

Method 1. Ideally, the CK value would explicitly in- 
clude both a list of the new membership set (an encoded 
15 set of nodes) and the epoch number. This may not be 
desired if the number of nodes is high, however, be- 
cause the value of CK would have to include at least a 
bit of information for each node. That is, in a four-node 
configuration at least a four-bit sequence BBBB (where 
20 B = 0 or 1 ) would need to be used, each bit B indicating 
whether a given associated node is active or inactive 
(failed). In addition, several bits are necessary for the 
epoch number EN, so the total length of the variable CK 
may be quite long. 
2S Method 2 and 3 below are designed to compress 
the membership information when calculating the CK 
value. 

il^ef/iod 2 uses only the epoch number EN and ig- 
nores the membership list MEM. For example, the CK 
30 value is set to equal the epoch number EN. 

Method 2 is most practical if the membership pro- 
tocol prevents network partitioning (e.g. by majority quo- 
rum voting). If membership partitioning is allowed, e.g. 
in the case of a hardware failure, the use of the CK value 
35 without reflecting the actual membership of the cluster 
could lead to conflicts between the nodes on either side 
of the partition. 

Method 3 solves the challenge of Method 2 with re- 
spect to partitions. In this method, the CK value is en- 
40 coded with an identification of the highest node in the 
new configuration. For example, the CK value may be 
a concatenation of a node identifier (a number assigned 
to the highest node) and the epoch number. This method 
provides safe disk fencing even if the membership mon- 
45 itor itself does not prevent network partition ing, since the 
number of the highest node in a given partition will be 
different from that of another partition; hence, there can- 
not be a conflict between requests from nodes in differ- 
ent partitions, even if the EN's for the different subclus- 
so ters happen to be the same. 

Of the foregoing, with a small number of nodes 
Method 1 is preferred, since it contains the most explicit 
information on the state of the clustered system. How- 
ever, with numerous nodes Method 3 becomes prefer- 
55 able. If the system prevents network partitioning, then 
Method 2 is suitable. 
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The Method of the Invention 

Given the foregoing structures and functions, and 
appropriate modules to innplement them, the disk fenc- 
ing system of the invention is achieved by following the s 
method 510 Illustrated in the flow chart of Figure 5. At 
box (step) 520, the membership of the clustered system 
is determined In a conventional manner, and the value 
of the membership set (or list) is stored as the value of 
MEM. An epoch number EN (or other unique state iden- io 
tifier) is generated at box 530. These two functions are 
carried out by the membership monitor (MM) module, 
which is implemented among the member nodes to de- 
termine which nodes are present in the system and then 
to assign a value of E N to that configuration. An example is 
of a system that uses an MM module in this way is ap- 
plicant Sun Microsystems, Inc.'s SparcCluster PDB 
(parallel database). 

In current systems, the epoch numbers are used so 
that a node can determine whether a given message or 20 
data packet is stale; if the epoch number is out of date 
then the message is known to be have been created 
during an older, different configuration of the cluster. 
(See, for instance, T. Mann et aL, "An Algorithm for Data 
Replication". DEC SRC Research Report, June 1989, 2S 
incorporated herein by reference, wherein epoch num- 
bers are described as being used in stamping file repli- 
cas in a distributed system.) 

The present system uses the epoch number in an 
entirely new way, which is unrelated to prior systems' 30 
usage of the epoch number. For an example of a pre- 
ferred manner of using a cluster membership monitor in 
Sun Microsystems, Inc.'s systems, see Appendix A at- 
tached hereto, in which the reconfiguration sequence 
numbers are analogous to epoch numbers. Thus, the 35 
distinct advantage is presented that the current inven- 
tion solves a long-standing problem, that of quickly and 
reliably eliminating failed nodes from a cluster member- 
ship and preventing them from continuing to access 
shared disks, without requiring new procedures to gen- 40 
erate new outputs to control the process; rather, the 
types of information that is already generated may be 
used in conjunction with modules according to the in- 
vention to accomplish the desired functions, resulting in 
a reliable high-availability system. 45 

Proceeding to box 540, the node key NK (for active 
nodes) and control key CK are generated by one of the 
Methods 1-3 described above or by another suitable 
method. 

At box 550, it is determined whether a node has be- so 
come unavailable. This step is carried out virtually con- 
tinuously (or at least with relatively high frequency, e.g. 
higher than the frequency of I/O requests); for instance, 
at almost any time a given node may determine that an- 
other node has exceeded the allowable time to respond ss 
to a request, and decide that the latter node has failed 
and should be removed from the cluster's membership 
set. Thus, the step in box 550 may take place almost 



anywhere during the execution of the method. 

Box 560 represents an event where one of the 
nodes connected to the cluster generates an I/O request 
(such as a disk access request). If so, then at box 570 
the current value of NK from the requesting node is sent 
with the 1/0 access request, and at box 580 it is deter- 
mined whether this matches the value of CK stored by 
the controller If not, the method proceeds to step 600, 
where the request is rejected (which may mean merely 
dropped by the controller with no action), and proceeds 
then back to box 520. 

If the node's NK value matches the controller's CK 
value, then the request is carried out at box 590. 

If a node has failed, then the method proceeds from 
box 550 back to box 520, where the failed node is elim- 
inated in a conventional fashion from the membership 
set, and thus the value of MEM changes to reflect this. 
At this time, a new epoch number EN is generated (at 
box 530) and stored, to reflect the newly revised mem- 
bership list. In addition, at box 540 a new control key 
value CK is generated, the active nodes' NK values take 
on the value of the new CK value, and the method pro- 
ceeds again to boxes 550-560 for further disk accesses. 

It will be seen from the foregoing that the failure of 
a given node in a clustered system results both in the 
removal of that node from the cluster membership and, 
importantly, the reliable prevention of any further disk 
accesses to shared disks by the failed node. The inval- 
idating of the failed node from shared disk accesses 
does not rely upon either hardware or software of the 
failed node to operate properly, but rather is entirely in- 
dependent of the tailed node. 

Since the CK values are stored at the disk control- 
lers and are used by an access control module to pre- 
vent failed nodes from gaining shared disk access, the 
disk fencing system of the invention is as reliable as the 
disk management software itself. Thus, the clustered 
system can rapidly and reliably eliminate the failed node 
with minimal risk of compromising the integrity of data 
stored on its shared disks. 

The described invention has the important advan- 
tage over prior systems that its end-to-end properties 
make it independent of disk interconnect network or bus 
configuration; thus, the node configuration alone is tak- 
en into account in determining the epoch number or oth- 
er unique status value, i.e. independent of any low-level 
mechanisms (such as transport mechanisms). 

Note that the system of the invention may be ap- 
plied to other peripheral devices accessed by multiple 
nodes in a multiprocessor system. For instance, other 
I/O or memory devices may be substituted in place of 
the shared disks discussed above; a controller corre- 
sponding to the disk controllers 470 and 480 would be 
used, and equipped with software modules to carry out 
the fencing operation. 

In addition, the nodes, i.e. processor-based sys- 
tems, that are members of the cluster can be any of a 
variety of processor-based devices, and in particular 
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need not specifically be personal computers or worksta- 
tions, but may be other processor-driven devices capa- 
ble of Issuing access requests to peripheral devices 
such as shared disks. 



Claims 

1 . A method for preventing access to a shared periph- 
eral device by a processor-based node in a multin- 
ode system, including the steps of: 

(1 ) storing at the peripheral device a first unique 
value representing a first configuration of the 
multinode system; 

(2) sending an access request from the node to 
the device, the request including a second 
unique value representing a second configura- 
tion of the multi-node system; 

(3) determining whether said first and second 
values are identical; and 

(4) if the first and second values are identical, 
then executing the access request at the pe- 
ripheral device. 

2. The method of claim 1 , wherein: 

said first value is generated utilizing at least In 
part information relating to a first time when the 
multinode system was in said first configura- 
tion; and 

said second value is generated utilizing at least 
in part information relating to a second time 
when the multinode system was in said second 
configuration. 

3. The method of claim 2, wherein: 

step 3 includes the step of determining wheth- 
er said first and second times are identical. 

4. The method of claim 1 , wherein said first and sec- 
ond values are generated based at least in part on 
epoch numbers generated by a membership proto- 
col executing on said multinode system. 

5. The method of claim 4, wherein each of said first 
and second values is generated based at least in 
part on respective membership sets of said multin- 
ode system generated by said membership proto- 
col. 

6. The method of claim 1, wherein each of said first 
and second values is generated based at least in 
part on respective membership sets of said multin- 
ode system generated by said membership proto- 
col. 

7. An apparatus for preventing access to at least one 



shared peripheral resource by a processor-based 
node in a multinode system, the resource being 
coupled to the system by a resource controller in- 
cluding a controller memory, each of a plurality of 
5 nodes on the system including a processor coupled 
to a node memory storing program modules config- 
ured to executing functions of the invention, the ap- 
paratus including: 

fo a membership monitor module configured to 

determine a membership list of the nodes, in- 
cluding said resource, on the system at prede- 
termined times, including at least at a time 
when the membership of the system changes; 
^5 a resource manager module configured to de- 

termine when the resource is in a failed state 
and for communicating the failure of the re- 
source to said membership monitor to indicate 
to the membership monitor to generate a new 
20 membership list; 

a configuration value module configured to 
generate a unique value based upon said new 
membership list and to store said unique value 
locally at each node on the system; and 
25 an access control module stored at said con- 

troller memory configured to block access re- 
quests by at least one said requesting node to 
said resource when the locally stored unique 
value at said requesting node does not equal 
30 the unique value stored at said resource con- 

troller. 

8. The apparatus of claim 7, wherein said configura- 
tion value monitor module is configured to deter- 

35 mine said unique value based at least in part upon 
a time stamp indicating the time at which the corre- 
sponding membership list was generated. 

9. The apparatus of claim 7, wherein said unique value 
40 is based at least in part upon an epoch number gen- 
erated by a membership protocol module. 

1 0. The apparatus of claim 7, wherein said membership 
monitor module is configured to execute independ- 

45 ently of any action by said shared resource when 
said shared resource is in a failed state. 

11. The apparatus of claim 7, wherein said resource 
manager module is configured to execute inde- 

50 pendently of any action by said shared resource 
when said shared resource is in a failed state. 

12. The apparatus of claim 7, wherein said configura- 
tion module is configured to execute independently 

55 of any action by said shared resource when said 
shared resource is in a failed state. 

1 3. The apparatus of claim 7, wherein said access con- 
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trol module is configured to execute independently 
of any action by said shared resource wfien said 
shared resource is in a failed state. 
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